Computation and Language 89
☆ ModuleFormer: Learning Modular Large Language Models From Uncurated Data
Large Language Models (LLMs) have achieved remarkable results. But existing
models are expensive to train and deploy, and it is also difficult to expand
their knowledge beyond pre-training data without forgetting previous knowledge.
This paper proposes a new neural network architecture, ModuleFormer, that
leverages modularity to improve the efficiency and flexibility of large
language models. ModuleFormer is based on the Sparse Mixture of Experts (SMoE).
Unlike the previous SMoE-based modular language model [Gururangan et al.,
2021], which requires domain-labeled data to learn domain-specific experts,
ModuleFormer can induce modularity from uncurated data with its new load
balancing and load concentration losses. ModuleFormer is a modular architecture
that includes two different types of modules, new stick-breaking attention
heads, and feedforward experts. Different modules are sparsely activated
conditions on the input token during training and inference. In our experiment,
we found that the modular architecture enables three important abilities for
large pre-trained language models: 1) Efficiency, since ModuleFormer only
activates a subset of its modules for each input token, thus it could achieve
the same performance as dense LLMs with more than two times throughput; 2)
Extendability, ModuleFormer is more immune to catastrophic forgetting than
dense LLMs and can be easily extended with new modules to learn new knowledge
that is not included in the training data; 3) Specialisation, finetuning
ModuleFormer could specialize a subset of modules to the finetuning task, and
the task-unrelated modules could be easily pruned for a lightweight deployment.
☆ Transformers as Statisticians: Provable In-Context Learning with In-Context Algorithm Selection
Neural sequence models based on the transformer architecture have
demonstrated remarkable \emph{in-context learning} (ICL) abilities, where they
can perform new tasks when prompted with training and test examples, without
any parameter update to the model. This work first provides a comprehensive
statistical theory for transformers to perform ICL. Concretely, we show that
transformers can implement a broad class of standard machine learning
algorithms in context, such as least squares, ridge regression, Lasso, learning
generalized linear models, and gradient descent on two-layer neural networks,
with near-optimal predictive power on various in-context data distributions.
Using an efficient implementation of in-context gradient descent as the
underlying mechanism, our transformer constructions admit mild size bounds, and
can be learned with polynomially many pretraining sequences.
Building on these ``base'' ICL algorithms, intriguingly, we show that
transformers can implement more complex ICL procedures involving
\emph{in-context algorithm selection}, akin to what a statistician can do in
real life -- A \emph{single} transformer can adaptively select different base
ICL algorithms -- or even perform qualitatively different tasks -- on different
input sequences, without any explicit prompting of the right algorithm or task.
We both establish this in theory by explicit constructions, and also observe
this phenomenon experimentally. In theory, we construct two general mechanisms
for algorithm selection with concrete examples: pre-ICL testing, and post-ICL
validation. As an example, we use the post-ICL validation mechanism to
construct a transformer that can perform nearly Bayes-optimal ICL on a
challenging task -- noisy linear models with mixed noise levels.
Experimentally, we demonstrate the strong in-context algorithm selection
capabilities of standard transformer architectures.
☆ On the Reliability of Watermarks for Large Language Models
John Kirchenbauer, Jonas Geiping, Yuxin Wen, Manli Shu, Khalid Saifullah, Kezhi Kong, Kasun Fernando, Aniruddha Saha, Micah Goldblum, Tom Goldstein
Large language models (LLMs) are now deployed to everyday use and positioned
to produce large quantities of text in the coming decade. Machine-generated
text may displace human-written text on the internet and has the potential to
be used for malicious purposes, such as spearphishing attacks and social media
bots. Watermarking is a simple and effective strategy for mitigating such harms
by enabling the detection and documentation of LLM-generated text. Yet, a
crucial question remains: How reliable is watermarking in realistic settings in
the wild? There, watermarked text might be mixed with other text sources,
paraphrased by human writers or other language models, and used for
applications in a broad number of domains, both social and technical. In this
paper, we explore different detection schemes, quantify their power at
detecting watermarks, and determine how much machine-generated text needs to be
observed in each scenario to reliably detect the watermark. We especially
highlight our human study, where we investigate the reliability of watermarking
when faced with human paraphrasing. We compare watermark-based detection to
other detection strategies, finding overall that watermarking is a reliable
solution, especially because of its sample complexity - for all attacks we
consider, the watermark evidence compounds the more examples are given, and the
watermark is eventually detected.
comment: 14 pages in the main body. Code is available at
https://github.com/jwkirchenbauer/lm-watermarking
☆ Revisiting Out-of-distribution Robustness in NLP: Benchmark, Analysis, and LLMs Evaluations
Lifan Yuan, Yangyi Chen, Ganqu Cui, Hongcheng Gao, Fangyuan Zou, Xingyi Cheng, Heng Ji, Zhiyuan Liu, Maosong Sun
This paper reexamines the research on out-of-distribution (OOD) robustness in
the field of NLP. We find that the distribution shift settings in previous
studies commonly lack adequate challenges, hindering the accurate evaluation of
OOD robustness. To address these issues, we propose a benchmark construction
protocol that ensures clear differentiation and challenging distribution
shifts. Then we introduce BOSS, a Benchmark suite for Out-of-distribution
robustneSS evaluation covering 5 tasks and 20 datasets. Based on BOSS, we
conduct a series of experiments on pre-trained language models for analysis and
evaluation of OOD robustness. First, for vanilla fine-tuning, we examine the
relationship between in-distribution (ID) and OOD performance. We identify
three typical types that unveil the inner learning mechanism, which could
potentially facilitate the forecasting of OOD robustness, correlating with the
advancements on ID datasets. Then, we evaluate 5 classic methods on BOSS and
find that, despite exhibiting some effectiveness in specific cases, they do not
offer significant improvement compared to vanilla fine-tuning. Further, we
evaluate 5 LLMs with various adaptation paradigms and find that when sufficient
ID data is available, fine-tuning domain-specific models outperform LLMs on ID
examples significantly. However, in the case of OOD instances, prioritizing
LLMs with in-context learning yields better results. We identify that both
fine-tuned small models and LLMs face challenges in effectively addressing
downstream tasks. The code is public at
\url{https://github.com/lifan-yuan/OOD_NLP}.
comment: Code is available at \url{https://github.com/lifan-yuan/OOD_NLP}
☆ The Two Word Test: A Semantic Benchmark for Large Language Models NeurIPS 2023
Large Language Models (LLMs) have shown remarkable abilities recently,
including passing advanced professional exams and demanding benchmark tests.
This performance has led many to suggest that they are close to achieving
humanlike or 'true' understanding of language, and even Artificial General
Intelligence (AGI). Here, we provide a new open-source benchmark that can
assess semantic abilities of LLMs using two-word phrases using a task that can
be performed relatively easily by humans without advanced training. Combining
multiple words into a single concept is a fundamental aspect of human language
and intelligence. The test requires meaningfulness judgments of 1768 noun-noun
combinations that have been rated as meaningful (e.g., baby boy) or not
meaningful (e.g., goat sky). by 150 human raters. We provide versions of the
task that probe meaningfulness ratings on a 0-4 scale as well as binary
judgments. We conducted a series of experiments using the TWT on GPT-4,
GPT-3.5, and Bard, with both versions. Results demonstrated that, compared to
humans, all models perform poorly at rating meaningfulness of these phrases.
GPT-3.5 and Bard are also unable to make binary discriminations between
sensible and nonsense phrases as making sense. GPT-4 makes a substantial
improvement in binary discrimination of combinatorial phrases but is still
significantly worse than human performance. The TWT can be used to understand
the limitations and weaknesses of current LLMs, and potentially improve them.
The test also reminds us that caution is warranted in attributing 'true
understanding' or AGI to LLMs. TWT is available at:
https://github.com/NickRiccardi/two-word-test
comment: 12 pages, 5 figures, 3 tables, submitted to NeurIPS 2023 Datasets and
Benchmarks Track
☆ Language Models Get a Gender Makeover: Mitigating Gender Bias with Few-Shot Data Interventions ACL 2023
Societal biases present in pre-trained large language models are a critical
issue as these models have been shown to propagate biases in countless
downstream applications, rendering them unfair towards specific groups of
people. Since large-scale retraining of these models from scratch is both time
and compute-expensive, a variety of approaches have been previously proposed
that de-bias a pre-trained model. While the majority of current
state-of-the-art debiasing methods focus on changes to the training regime, in
this paper, we propose data intervention strategies as a powerful yet simple
technique to reduce gender bias in pre-trained models. Specifically, we
empirically show that by fine-tuning a pre-trained model on only 10 de-biased
(intervened) training examples, the tendency to favor any gender is
significantly reduced. Since our proposed method only needs a few training
examples, our few-shot debiasing approach is highly feasible and practical.
Through extensive experimentation, we show that our debiasing technique
performs better than competitive state-of-the-art baselines with minimal loss
in language modeling ability.
comment: Accepted to ACL 2023 Main Conference
☆ Gender, names and other mysteries: Towards the ambiguous for gender-inclusive translation
The vast majority of work on gender in MT focuses on 'unambiguous' inputs,
where gender markers in the source language are expected to be resolved in the
output. Conversely, this paper explores the widespread case where the source
sentence lacks explicit gender markers, but the target sentence contains them
due to richer grammatical gender. We particularly focus on inputs containing
person names.
Investigating such sentence pairs casts a new light on research into MT
gender bias and its mitigation. We find that many name-gender co-occurrences in
MT data are not resolvable with 'unambiguous gender' in the source language,
and that gender-ambiguous examples can make up a large proportion of training
examples. From this, we discuss potential steps toward gender-inclusive
translation which accepts the ambiguity in both gender and translation.
comment: GITT workshop at EAMT 2023
☆ ChatGPT is fun, but it is not funny! Humor is still challenging Large Language Models
Humor is a central aspect of human communication that has not been solved for
artificial agents so far. Large language models (LLMs) are increasingly able to
capture implicit and contextual information. Especially, OpenAI's ChatGPT
recently gained immense public attention. The GPT3-based model almost seems to
communicate on a human level and can even tell jokes. Humor is an essential
component of human communication. But is ChatGPT really funny? We put ChatGPT's
sense of humor to the test. In a series of exploratory experiments around
jokes, i.e., generation, explanation, and detection, we seek to understand
ChatGPT's capability to grasp and reproduce human humor. Since the model itself
is not accessible, we applied prompt-based experiments. Our empirical evidence
indicates that jokes are not hard-coded but mostly also not newly generated by
the model. Over 90% of 1008 generated jokes were the same 25 Jokes. The system
accurately explains valid jokes but also comes up with fictional explanations
for invalid jokes. Joke-typical characteristics can mislead ChatGPT in the
classification of jokes. ChatGPT has not solved computational humor yet but it
can be a big leap toward "funny" machines.
☆ Multi-Task Training with In-Domain Language Models for Diagnostic Reasoning
Generative artificial intelligence (AI) is a promising direction for
augmenting clinical diagnostic decision support and reducing diagnostic errors,
a leading contributor to medical errors. To further the development of clinical
AI systems, the Diagnostic Reasoning Benchmark (DR.BENCH) was introduced as a
comprehensive generative AI framework, comprised of six tasks representing key
components in clinical reasoning. We present a comparative analysis of
in-domain versus out-of-domain language models as well as multi-task versus
single task training with a focus on the problem summarization task in DR.BENCH
(Gao et al., 2023). We demonstrate that a multi-task, clinically trained
language model outperforms its general domain counterpart by a large margin,
establishing a new state-of-the-art performance, with a ROUGE-L score of 28.55.
This research underscores the value of domain-specific training for optimizing
clinical diagnostic reasoning tasks.
comment: Accepted to 2023 ClinicalNLP Workshop
☆ Contrastive Bootstrapping for Label Refinement ACL 2023
Traditional text classification typically categorizes texts into pre-defined
coarse-grained classes, from which the produced models cannot handle the
real-world scenario where finer categories emerge periodically for accurate
services. In this work, we investigate the setting where fine-grained
classification is done only using the annotation of coarse-grained categories
and the coarse-to-fine mapping. We propose a lightweight contrastive
clustering-based bootstrapping method to iteratively refine the labels of
passages. During clustering, it pulls away negative passage-prototype pairs
under the guidance of the mapping from both global and local perspectives.
Experiments on NYT and 20News show that our method outperforms the
state-of-the-art methods by a large margin.
comment: ACL 2023
☆ Multimodal Learning Without Labeled Multimodal Data: Guarantees and Applications
Paul Pu Liang, Chun Kai Ling, Yun Cheng, Alex Obolenskiy, Yudong Liu, Rohan Pandey, Alex Wilf, Louis-Philippe Morency, Ruslan Salakhutdinov
In many machine learning systems that jointly learn from multiple modalities,
a core research question is to understand the nature of multimodal
interactions: the emergence of new task-relevant information during learning
from both modalities that was not present in either alone. We study this
challenge of interaction quantification in a semi-supervised setting with only
labeled unimodal data and naturally co-occurring multimodal data (e.g.,
unlabeled images and captions, video and corresponding audio) but when labeling
them is time-consuming. Using a precise information-theoretic definition of
interactions, our key contributions are the derivations of lower and upper
bounds to quantify the amount of multimodal interactions in this
semi-supervised setting. We propose two lower bounds based on the amount of
shared information between modalities and the disagreement between separately
trained unimodal classifiers, and derive an upper bound through connections to
approximate algorithms for min-entropy couplings. We validate these estimated
bounds and show how they accurately track true interactions. Finally, two
semi-supervised multimodal applications are explored based on these theoretical
results: (1) analyzing the relationship between multimodal performance and
estimated interactions, and (2) self-supervised learning that embraces
disagreement between modalities beyond agreement as is typically done.
comment: Code available at: https://github.com/pliang279/PID
☆ Long-form analogies generated by chatGPT lack human-like psycholinguistic properties
Psycholinguistic analyses provide a means of evaluating large language model
(LLM) output and making systematic comparisons to human-generated text. These
methods can be used to characterize the psycholinguistic properties of LLM
output and illustrate areas where LLMs fall short in comparison to
human-generated text. In this work, we apply psycholinguistic methods to
evaluate individual sentences from long-form analogies about biochemical
concepts. We compare analogies generated by human subjects enrolled in
introductory biochemistry courses to analogies generated by chatGPT. We perform
a supervised classification analysis using 78 features extracted from
Coh-metrix that analyze text cohesion, language, and readability (Graesser et.
al., 2004). Results illustrate high performance for classifying
student-generated and chatGPT-generated analogies. To evaluate which features
contribute most to model performance, we use a hierarchical clustering
approach. Results from this analysis illustrate several linguistic differences
between the two sources.
comment: arxiv version of conference paper to appear at CogSci 2023 conference
☆ PromptAttack: Probing Dialogue State Trackers with Adversarial Prompts ACL 2023
A key component of modern conversational systems is the Dialogue State
Tracker (or DST), which models a user's goals and needs. Toward building more
robust and reliable DSTs, we introduce a prompt-based learning approach to
automatically generate effective adversarial examples to probe DST models. Two
key characteristics of this approach are: (i) it only needs the output of the
DST with no need for model parameters, and (ii) it can learn to generate
natural language utterances that can target any DST. Through experiments over
state-of-the-art DSTs, the proposed framework leads to the greatest reduction
in accuracy and the best attack success rate while maintaining good fluency and
a low perturbation ratio. We also show how much the generated adversarial
examples can bolster a DST through adversarial training. These results indicate
the strength of prompt-based attacks on DSTs and leave open avenues for
continued refinement.
comment: To appear in Findings of ACL 2023
☆ Lenient Evaluation of Japanese Speech Recognition: Modeling Naturally Occurring Spelling Inconsistency ACL
Word error rate (WER) and character error rate (CER) are standard metrics in
Speech Recognition (ASR), but one problem has always been alternative
spellings: If one's system transcribes adviser whereas the ground truth has
advisor, this will count as an error even though the two spellings really
represent the same word.
Japanese is notorious for ``lacking orthography'': most words can be spelled
in multiple ways, presenting a problem for accurate ASR evaluation. In this
paper we propose a new lenient evaluation metric as a more defensible CER
measure for Japanese ASR. We create a lattice of plausible respellings of the
reference transcription, using a combination of lexical resources, a Japanese
text-processing system, and a neural machine translation model for
reconstructing kanji from hiragana or katakana. In a manual evaluation, raters
rated 95.4% of the proposed spelling variants as plausible. ASR results show
that our method, which does not penalize the system for choosing a valid
alternate spelling of a word, affords a 2.4%-3.1% absolute reduction in CER
depending on the task.
comment: ACL Workshop on Computation and Written Language (CAWL) 2023
☆ Can current NLI systems handle German word order? Investigating language model performance on a new German challenge set of minimal pairs
Compared to English, German word order is freer and therefore poses
additional challenges for natural language inference (NLI). We create WOGLI
(Word Order in German Language Inference), the first adversarial NLI dataset
for German word order that has the following properties: (i) each premise has
an entailed and a non-entailed hypothesis; (ii) premise and hypotheses differ
only in word order and necessary morphological changes to mark case and number.
In particular, each premise andits two hypotheses contain exactly the same
lemmata. Our adversarial examples require the model to use morphological
markers in order to recognise or reject entailment. We show that current German
autoencoding models fine-tuned on translated NLI data can struggle on this
challenge set, reflecting the fact that translated NLI datasets will not mirror
all necessary language phenomena in the target language. We also examine
performance after data augmentation as well as on related word order phenomena
derived from WOGLI. Our datasets are publically available at
https://github.com/ireinig/wogli.
☆ Enhancing In-Context Learning with Answer Feedback for Multi-Span Question Answering NLPCC 2023
Whereas the recent emergence of large language models (LLMs) like ChatGPT has
exhibited impressive general performance, it still has a large gap with
fully-supervised models on specific tasks such as multi-span question
answering. Previous researches found that in-context learning is an effective
approach to exploiting LLM, by using a few task-related labeled data as
demonstration examples to construct a few-shot prompt for answering new
questions. A popular implementation is to concatenate a few questions and their
correct answers through simple templates, informing LLM of the desired output.
In this paper, we propose a novel way of employing labeled data such that it
also informs LLM of some undesired output, by extending demonstration examples
with feedback about answers predicted by an off-the-shelf model, e.g., correct,
incorrect, or incomplete. Experiments on three multi-span question answering
datasets as well as a keyphrase extraction dataset show that our new prompting
strategy consistently improves LLM's in-context learning performance.
comment: 12 pages, submitted to NLPCC 2023
☆ Evaluation of ChatGPT on Biomedical Tasks: A Zero-Shot Comparison with Fine-Tuned Generative Transformers ACL 2023
ChatGPT is a large language model developed by OpenAI. Despite its impressive
performance across various tasks, no prior work has investigated its capability
in the biomedical domain yet. To this end, this paper aims to evaluate the
performance of ChatGPT on various benchmark biomedical tasks, such as relation
extraction, document classification, question answering, and summarization. To
the best of our knowledge, this is the first work that conducts an extensive
evaluation of ChatGPT in the biomedical domain. Interestingly, we find based on
our evaluation that in biomedical datasets that have smaller training sets,
zero-shot ChatGPT even outperforms the state-of-the-art fine-tuned generative
transformer models, such as BioGPT and BioBART. This suggests that ChatGPT's
pre-training on large text corpora makes it quite specialized even in the
biomedical domain. Our findings demonstrate that ChatGPT has the potential to
be a valuable tool for various tasks in the biomedical domain that lack large
annotated data.
comment: Accepted by BioNLP@ACL 2023
☆ STEPS: A Benchmark for Order Reasoning in Sequential Tasks
Various human activities can be abstracted into a sequence of actions in
natural text, i.e. cooking, repairing, manufacturing, etc. Such action
sequences heavily depend on the executing order, while disorder in action
sequences leads to failure of further task execution by robots or AI agents.
Therefore, to verify the order reasoning capability of current neural models in
sequential tasks, we propose a challenging benchmark , named STEPS. STEPS
involves two subtask settings, focusing on determining the rationality of given
next step in recipes and selecting the reasonable step from the multi-choice
question, respectively. We describe the data construction and task
formulations, and benchmark most of significant Large Language Models (LLMs).
The experimental results demonstrate 1) The commonsense reasoning of action
orders in sequential tasks are challenging to resolve via zero-shot prompting
or few-shot in-context learning for LLMs; 2) Prompting method still
significantly lags behind tuning-based method on STEPS.
comment: Work in Progress
☆ Zambezi Voice: A Multilingual Speech Corpus for Zambian Languages INTERSPEECH 2023
Claytone Sikasote, Kalinda Siaminwe, Stanly Mwape, Bangiwe Zulu, Mofya Phiri, Martin Phiri, David Zulu, Mayumbo Nyirenda, Antonios Anastasopoulos
This work introduces Zambezi Voice, an open-source multilingual speech
resource for Zambian languages. It contains two collections of datasets:
unlabelled audio recordings of radio news and talk shows programs (160 hours)
and labelled data (over 80 hours) consisting of read speech recorded from text
sourced from publicly available literature books. The dataset is created for
speech recognition but can be extended to multilingual speech processing
research for both supervised and unsupervised learning approaches. To our
knowledge, this is the first multilingual speech dataset created for Zambian
languages. We exploit pretraining and cross-lingual transfer learning by
finetuning the Wav2Vec2.0 large-scale multilingual pre-trained model to build
end-to-end (E2E) speech recognition models for our baseline models. The dataset
is released publicly under a Creative Commons BY-NC-ND 4.0 license and can be
accessed through the project repository. See
https://github.com/unza-speech-lab/zambezi-voice
comment: Accepted at INTERSPEECH 2023
☆ Examining Bias in Opinion Summarisation Through the Perspective of Opinion Diversity WASSA
Opinion summarisation is a task that aims to condense the information
presented in the source documents while retaining the core message and
opinions. A summary that only represents the majority opinions will leave the
minority opinions unrepresented in the summary. In this paper, we use the
stance towards a certain target as an opinion. We study bias in opinion
summarisation from the perspective of opinion diversity, which measures whether
the model generated summary can cover a diverse set of opinions. In addition,
we examine opinion similarity, a measure of how closely related two opinions
are in terms of their stance on a given topic, and its relationship with
opinion diversity. Through the lens of stances towards a topic, we examine
opinion diversity and similarity using three debatable topics under COVID-19.
Experimental results on these topics revealed that a higher degree of
similarity of opinions did not indicate good diversity or fairly cover the
various opinions originally presented in the source documents. We found that
BART and ChatGPT can better capture diverse opinions presented in the source
documents.
comment: 9 pages, 3 figures, accepted at WASSA, ACL 2023
☆ Transfer Learning of Transformer-based Speech Recognition Models from Czech to Slovak
In this paper, we are comparing several methods of training the Slovak speech
recognition models based on the Transformers architecture. Specifically, we are
exploring the approach of transfer learning from the existing Czech pre-trained
Wav2Vec 2.0 model into Slovak. We are demonstrating the benefits of the
proposed approach on three Slovak datasets. Our Slovak models scored the best
results when initializing the weights from the Czech model at the beginning of
the pre-training phase. Our results show that the knowledge stored in the Cezch
pre-trained model can be successfully reused to solve tasks in Slovak while
outperforming even much larger public multilingual models.
comment: Accepted to TSD 2023
☆ M$^3$IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning
Lei Li, Yuwei Yin, Shicheng Li, Liang Chen, Peiyi Wang, Shuhuai Ren, Mukai Li, Yazheng Yang, Jingjing Xu, Xu Sun, Lingpeng Kong, Qi Liu
Instruction tuning has significantly advanced large language models (LLMs)
such as ChatGPT, enabling them to align with human instructions across diverse
tasks. However, progress in open vision-language models (VLMs) has been limited
due to the scarcity of high-quality instruction datasets. To tackle this
challenge and promote research in the vision-language field, we introduce the
Multi-Modal, Multilingual Instruction Tuning (M$^3$IT) dataset, designed to
optimize VLM alignment with human instructions. Our M$^3$IT dataset comprises
40 carefully curated datasets, including 2.4 million instances and 400 manually
written task instructions, reformatted into a vision-to-text structure. Key
tasks are translated into 80 languages with an advanced translation system,
ensuring broader accessibility. M$^3$IT surpasses previous datasets regarding
task coverage, instruction number and instance scale. Moreover, we develop
Ying-VLM, a VLM model trained on our M$^3$IT dataset, showcasing its potential
to answer complex questions requiring world knowledge, generalize to unseen
video tasks, and comprehend unseen instructions in Chinese. To encourage
further research, we have open-sourced both the dataset and trained models.
comment: Dataset available at: https://huggingface.co/MMInstruction/M3IT
☆ Multilingual Clinical NER: Translation or Cross-lingual Transfer?
Natural language tasks like Named Entity Recognition (NER) in the clinical
domain on non-English texts can be very time-consuming and expensive due to the
lack of annotated data. Cross-lingual transfer (CLT) is a way to circumvent
this issue thanks to the ability of multilingual large language models to be
fine-tuned on a specific task in one language and to provide high accuracy for
the same task in another language. However, other methods leveraging
translation models can be used to perform NER without annotated data in the
target language, by either translating the training set or test set. This paper
compares cross-lingual transfer with these two alternative methods, to perform
clinical NER in French and in German without any training data in those
languages. To this end, we release MedNERF a medical NER test set extracted
from French drug prescriptions and annotated with the same guidelines as an
English dataset. Through extensive experiments on this dataset and on a German
medical dataset (Frei and Kramer, 2021), we show that translation-based methods
can achieve similar performance to CLT but require more care in their design.
And while they can take advantage of monolingual clinical language models,
those do not guarantee better results than large general-purpose multilingual
models, whether with cross-lingual transfer or translation.
comment: 23 pages, Proceedings of the 5th Clinical Natural Language Processing
Workshop
☆ Label Aware Speech Representation Learning For Language Identification
Shikhar Vashishth, Shikhar Bharadwaj, Sriram Ganapathy, Ankur Bapna, Min Ma, Wei Han, Vera Axelrod, Partha Talukdar
Speech representation learning approaches for non-semantic tasks such as
language recognition have either explored supervised embedding extraction
methods using a classifier model or self-supervised representation learning
approaches using raw data. In this paper, we propose a novel framework of
combining self-supervised representation learning with the language label
information for the pre-training task. This framework, termed as Label Aware
Speech Representation (LASR) learning, uses a triplet based objective function
to incorporate language labels along with the self-supervised loss function.
The speech representations are further fine-tuned for the downstream task. The
language recognition experiments are performed on two public datasets - FLEURS
and Dhwani. In these experiments, we illustrate that the proposed LASR
framework improves over the state-of-the-art systems on language
identification. We also report an analysis of the robustness of LASR approach
to noisy/missing labels as well as its application to multi-lingual speech
recognition tasks.
comment: Accepted at Interspeech 2023
☆ Arabic Dysarthric Speech Recognition Using Adversarial and Signal-Based Augmentation
Despite major advancements in Automatic Speech Recognition (ASR), the
state-of-the-art ASR systems struggle to deal with impaired speech even with
high-resource languages. In Arabic, this challenge gets amplified, with added
complexities in collecting data from dysarthric speakers. In this paper, we aim
to improve the performance of Arabic dysarthric automatic speech recognition
through a multi-stage augmentation approach. To this effect, we first propose a
signal-based approach to generate dysarthric Arabic speech from healthy Arabic
speech by modifying its speed and tempo. We also propose a second stage
Parallel Wave Generative (PWG) adversarial model that is trained on an English
dysarthric dataset to capture language-independant dysarthric speech patterns
and further augment the signal-adjusted speech samples. Furthermore, we propose
a fine-tuning and text-correction strategies for Arabic Conformer at different
dysarthric speech severity levels. Our fine-tuned Conformer achieved 18% Word
Error Rate (WER) and 17.2% Character Error Rate (CER) on synthetically
generated dysarthric speech from the Arabic commonvoice speech dataset. This
shows significant WER improvement of 81.8% compared to the baseline model
trained solely on healthy data. We perform further validation on real English
dysarthric speech showing a WER improvement of 124% compared to the baseline
trained only on healthy English LJSpeech dataset.
comment: Accepted to Interspeech 2023
☆ Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks
Haiyang Xu, Qinghao Ye, Xuan Wu, Ming Yan, Yuan Miao, Jiabo Ye, Guohai Xu, Anwen Hu, Yaya Shi, Guangwei Xu, Chenliang Li, Qi Qian, Maofei Que, Ji Zhang, Xiao Zeng, Fei Huang
To promote the development of Vision-Language Pre-training (VLP) and
multimodal Large Language Model (LLM) in the Chinese community, we firstly
release the largest public Chinese high-quality video-language dataset named
Youku-mPLUG, which is collected from Youku, a well-known Chinese video-sharing
website, with strict criteria of safety, diversity, and quality. Youku-mPLUG
contains 10 million Chinese video-text pairs filtered from 400 million raw
videos across a wide range of 45 diverse categories for large-scale
pre-training. In addition, to facilitate a comprehensive evaluation of
video-language models, we carefully build the largest human-annotated Chinese
benchmarks covering three popular video-language tasks of cross-modal
retrieval, video captioning, and video category classification. Youku-mPLUG can
enable researchers to conduct more in-depth multimodal research and develop
better applications in the future. Furthermore, we release popular
video-language pre-training models, ALPRO and mPLUG-2, and our proposed
modularized decoder-only model mPLUG-video pre-trained on Youku-mPLUG.
Experiments show that models pre-trained on Youku-mPLUG gain up to 23.1%
improvement in video category classification. Besides, mPLUG-video achieves a
new state-of-the-art result on these benchmarks with 80.5% top-1 accuracy in
video category classification and 68.9 CIDEr score in video captioning,
respectively. Finally, we scale up mPLUG-video based on the frozen Bloomz with
only 1.7% trainable parameters as Chinese multimodal LLM, and demonstrate
impressive instruction and video understanding ability. The zero-shot
instruction understanding experiment indicates that pretraining with
Youku-mPLUG can enhance the ability to comprehend overall and detailed visual
semantics, recognize scene text, and leverage open-domain knowledge.
comment: Working in progress
☆ ConTextual Masked Auto-Encoder for Retrieval-based Dialogue Systems
Dialogue response selection aims to select an appropriate response from
several candidates based on a given user and system utterance history. Recent
studies have been improving the accuracy of dialogue response selection through
post-training, mostly relying on naive masked language modeling methods.
However, the recently developed generative methods have shown promising text
representation capabilities in IR community, which could potentially lead to
better dialogue semantics modeling. Thus, in this paper, we propose Dial-MAE
(Dialogue Contextual Masking Auto-encoder), a straightforward yet effective
post-training technique tailored for dialogue response selection. Dial-MAE uses
an asymmetric encoder-decoder architecture that learns to better compress the
semantics of the dialogue into dialogue-dense vectors. The process of Dial-MAE
involves a deep encoder creating a dialogue embedding with the masked dialogue
context, followed by a shallow decoder that uses this embedding along with the
highly masked response to restore the original response. Our experiments have
demonstrated that Dial-MAE is highly effective, achieving state-of-the-art
performance on two commonly evaluated benchmarks.
☆ GPT Self-Supervision for a Better Data Annotator
The task of annotating data into concise summaries poses a significant
challenge across various domains, frequently requiring the allocation of
significant time and specialized knowledge by human experts. Despite existing
efforts to use large language models for annotation tasks, significant problems
such as limited applicability to unlabeled data, the absence of self-supervised
methods, and the lack of focus on complex structured data still persist. In
this work, we propose a GPT self-supervision annotation method. This method
embodies a generating-recovering paradigm that leverages the capabilities of
one-shot learning capabilities in Generative Pretrained Transformer (GPT). The
proposed approach comprises a one-shot tuning phase followed by a generation
phase. In the one-shot tuning phase, we sample a data from the support set as
part of the prompt for GPT to generate a textual summary, which is then used to
recover the original data. The alignment score between the recovered and
original data serves as a self-supervision navigator to refine the process. In
the generation stage, the optimally selected one-shot sample serves as a
template in the prompt and is applied to generating summaries from challenging
datasets. The annotation performance is evaluated by tuning several human
feedback reward networks and by calculating alignment scores between original
and recovered data at both sentence and structure levels. Our self-supervised
annotation method consistently achieves competitive scores, convincingly
demonstrating its robust strength in various data-to-summary annotation tasks.
☆ World Models for Math Story Problems ACL
Solving math story problems is a complex task for students and NLP models
alike, requiring them to understand the world as described in the story and
reason over it to compute an answer. Recent years have seen impressive
performance on automatically solving these problems with large pre-trained
language models and innovative techniques to prompt them. However, it remains
unclear if these models possess accurate representations of mathematical
concepts. This leads to lack of interpretability and trustworthiness which
impedes their usefulness in various applications. In this paper, we consolidate
previous work on categorizing and representing math story problems and develop
MathWorld, which is a graph-based semantic formalism specific for the domain of
math story problems. With MathWorld, we can assign world models to math story
problems which represent the situations and actions introduced in the text and
their mathematical relationships. We combine math story problems from several
existing datasets and annotate a corpus of 1,019 problems and 3,204 logical
forms with MathWorld. Using this data, we demonstrate the following use cases
of MathWorld: (1) prompting language models with synthetically generated
question-answer pairs to probe their reasoning and world modeling abilities,
and (2) generating new problems by using the world models as a design space.
comment: ACL Findings 2023
☆ Co-evolving Graph Reasoning Network for Emotion-Cause Pair Extraction ECML-PKDD 2023
Emotion-Cause Pair Extraction (ECPE) aims to extract all emotion clauses and
their corresponding cause clauses from a document. Existing approaches tackle
this task through multi-task learning (MTL) framework in which the two subtasks
provide indicative clues for ECPE. However, the previous MTL framework
considers only one round of multi-task reasoning and ignores the reverse
feedbacks from ECPE to the subtasks. Besides, its multi-task reasoning only
relies on semantics-level interactions, which cannot capture the explicit
dependencies, and both the encoder sharing and multi-task hidden states
concatenations can hardly capture the causalities. To solve these issues, we
first put forward a new MTL framework based on Co-evolving Reasoning. It (1)
models the bidirectional feedbacks between ECPE and its subtasks; (2) allows
the three tasks to evolve together and prompt each other recurrently; (3)
integrates prediction-level interactions to capture explicit dependencies. Then
we propose a novel multi-task relational graph (MRG) to sufficiently exploit
the causal relations. Finally, we propose a Co-evolving Graph Reasoning Network
(CGR-Net) that implements our MTL framework and conducts Co-evolving Reasoning
on MRG. Experimental results show that our model achieves new state-of-the-art
performance, and further analysis confirms the advantages of our method.
comment: Accepted by ECML-PKDD 2023
☆ A Study on the Reliability of Automatic Dysarthric Speech Assessments
Automating dysarthria assessments offers the opportunity to develop
effective, low-cost tools that address the current limitations of manual and
subjective assessments. Nonetheless, it is unclear whether current approaches
rely on dysarthria-related speech patterns or external factors. We aim toward
obtaining a clearer understanding of dysarthria patterns. To this extent, we
study the effects of noise in recordings, both through addition and reduction.
We design and implement a new method for visualizing and comparing feature
extractors and models, at a patient level, in a more interpretable way. We use
the UA-Speech dataset with a speaker-based split of the dataset. Results
reported in the literature appear to have been done irrespective of such split,
leading to models that may be overconfident due to data-leakage. We hope that
these results raise awareness in the research community regarding the
requirements for establishing reliable automatic dysarthria assessment systems.
☆ Echoes from Alexandria: A Large Resource for Multilingual Book Summarization ACL 2023
In recent years, research in text summarization has mainly focused on the
news domain, where texts are typically short and have strong layout features.
The task of full-book summarization presents additional challenges which are
hard to tackle with current resources, due to their limited size and
availability in English only. To overcome these limitations, we present "Echoes
from Alexandria", or in shortened form, "Echoes", a large resource for
multilingual book summarization. Echoes features three novel datasets: i)
Echo-Wiki, for multilingual book summarization, ii) Echo-XSum, for
extremely-compressive multilingual book summarization, and iii) Echo-FairySum,
for extractive book summarization. To the best of our knowledge, Echoes, with
its thousands of books and summaries, is the largest resource, and the first to
be multilingual, featuring 5 languages and 25 language pairs. In addition to
Echoes, we also introduce a new extractive-then-abstractive baseline, and,
supported by our experimental results and manual analysis of the summaries
generated, we argue that this baseline is more suitable for book summarization
than purely-abstractive approaches. We release our resource and software at
https://github.com/Babelscape/echoes-from-alexandria in the hope of fostering
innovative research in multilingual book summarization.
comment: 9 pages, long paper at ACL 2023
☆ IUTEAM1 at MEDIQA-Chat 2023: Is simple fine tuning effective for multilayer summarization of clinical conversations?
Clinical conversation summarization has become an important application of
Natural language Processing. In this work, we intend to analyze summarization
model ensembling approaches, that can be utilized to improve the overall
accuracy of the generated medical report called chart note. The work starts
with a single summarization model creating the baseline. Then leads to an
ensemble of summarization models trained on a separate section of the chart
note. This leads to the final approach of passing the generated results to
another summarization model in a multi-layer/stage fashion for better coherency
of the generated text. Our results indicate that although an ensemble of models
specialized in each section produces better results, the multi-layer/stage
approach does not improve accuracy. The code for the above paper is available
at https://github.com/dhananjay-srivastava/MEDIQA-Chat-2023-iuteam1.git
comment: preprint
☆ Cross-Genre Argument Mining: Can Language Models Automatically Fill in Missing Discourse Markers?
Available corpora for Argument Mining differ along several axes, and one of
the key differences is the presence (or absence) of discourse markers to signal
argumentative content. Exploring effective ways to use discourse markers has
received wide attention in various discourse parsing tasks, from which it is
well-known that discourse markers are strong indicators of discourse relations.
To improve the robustness of Argument Mining systems across different genres,
we propose to automatically augment a given text with discourse markers such
that all relations are explicitly signaled. Our analysis unveils that popular
language models taken out-of-the-box fail on this task; however, when
fine-tuned on a new heterogeneous dataset that we construct (including
synthetic and real examples), they perform considerably better. We demonstrate
the impact of our approach on an Argument Mining downstream task, evaluated on
different corpora, showing that language models can be trained to automatically
fill in discourse markers across different corpora, improving the performance
of a downstream model in some, but not all, cases. Our proposed approach can
further be employed as an assistive tool for better discourse understanding.
☆ Personality testing of GPT-3: Limited temporal reliability, but highlighted social desirability of GPT-3's personality instruments results
To assess the potential applications and limitations of chatbot GPT-3
Davinci-003, this study explored the temporal reliability of personality
questionnaires applied to the chatbot and its personality profile.
Psychological questionnaires were administered to the chatbot on two separate
occasions, followed by a comparison of the responses to human normative data.
The findings revealed varying levels of agreement in the chatbot's responses
over time, with some scales displaying excellent while others demonstrated poor
agreement. Overall, Davinci-003 displayed a socially desirable and pro-social
personality profile, particularly in the domain of communion. However, the
underlying basis of the chatbot's responses, whether driven by conscious
self-reflection or predetermined algorithms, remains uncertain.
comment: 18 pages, 1 table
☆ Allophant: Cross-lingual Phoneme Recognition with Articulatory Attributes INTERSPEECH 2023
This paper proposes Allophant, a multilingual phoneme recognizer. It requires
only a phoneme inventory for cross-lingual transfer to a target language,
allowing for low-resource recognition. The architecture combines a
compositional phone embedding approach with individually supervised phonetic
attribute classifiers in a multi-task architecture. We also introduce
Allophoible, an extension of the PHOIBLE database. When combined with a
distance based mapping approach for grapheme-to-phoneme outputs, it allows us
to train on PHOIBLE inventories directly. By training and evaluating on 34
languages, we found that the addition of multi-task learning improves the
model's capability of being applied to unseen phonemes and phoneme inventories.
On supervised languages we achieve phoneme error rate improvements of 11
percentage points (pp.) compared to a baseline without multi-task learning.
Evaluation of zero-shot transfer on 84 languages yielded a decrease in PER of
2.63 pp. over the baseline.
comment: 5 pages, 2 figures, 2 tables, accepted to INTERSPEECH 2023
☆ Phrase Retrieval for Open-Domain Conversational Question Answering with Conversational Dependency Modeling via Contrastive Learning ACL 2023
Open-Domain Conversational Question Answering (ODConvQA) aims at answering
questions through a multi-turn conversation based on a retriever-reader
pipeline, which retrieves passages and then predicts answers with them.
However, such a pipeline approach not only makes the reader vulnerable to the
errors propagated from the retriever, but also demands additional effort to
develop both the retriever and the reader, which further makes it slower since
they are not runnable in parallel. In this work, we propose a method to
directly predict answers with a phrase retrieval scheme for a sequence of
words, reducing the conventional two distinct subtasks into a single one. Also,
for the first time, we study its capability for ODConvQA tasks. However, simply
adopting it is largely problematic, due to the dependencies between previous
and current turns in a conversation. To address this problem, we further
introduce a novel contrastive learning strategy, making sure to reflect
previous turns when retrieving the phrase for the current context, by
maximizing representational similarities of consecutive turns in a conversation
while minimizing irrelevant conversational contexts. We validate our model on
two ODConvQA datasets, whose experimental results show that it substantially
outperforms the relevant baselines with the retriever-reader. Code is available
at: https://github.com/starsuzi/PRO-ConvQA.
comment: Findings of ACL 2023
☆ Analysis of the Fed's communication by using textual entailment model of Zero-Shot classification
In this study, we analyze documents published by central banks using text
mining techniques and propose a method to evaluate the policy tone of central
banks. Since the monetary policies of major central banks have a broad impact
on financial market trends, the pricing of risky assets, and the real economy,
market participants are attempting to more accurately capture changes in the
outlook for central banks' future monetary policies. Since the published
documents are also an important tool for the central bank to communicate with
the market, they are meticulously elaborated on grammatical syntax and wording,
and investors are urged to read more accurately about the central bank's policy
stance. Sentiment analysis on central bank documents has long been carried out,
but it has been difficult to interpret the meaning of the documents accurately
and to explicitly capture even the intentional change in nuance. This study
attempts to evaluate the implication of the zero-shot text classification
method for an unknown economic environment using the same model. We compare the
tone of the statements, minutes, press conference transcripts of FOMC meetings,
and the Fed officials' (chair, vice chair, and Governors) speeches. In
addition, the minutes of the FOMC meetings were subjected to a phase analysis
of changes in each policy stance since 1971.
comment: 6 pages, 4 figures, 2 Tables
☆ Multi-microphone Automatic Speech Segmentation in Meetings Based on Circular Harmonics Features ISCA
Speaker diarization is the task of answering Who spoke and when? in an audio
stream. Pipeline systems rely on speech segmentation to extract speakers'
segments and achieve robust speaker diarization. This paper proposes a common
framework to solve three segmentation tasks in the distant speech scenario:
Voice Activity Detection (VAD), Overlapped Speech Detection (OSD), and Speaker
Change Detection (SCD). In the literature, a few studies investigate the
multi-microphone distant speech scenario. In this work, we propose a new set of
spatial features based on direction-of-arrival estimations in the circular
harmonic domain (CH-DOA). These spatial features are extracted from
multi-microphone audio data and combined with standard acoustic features.
Experiments on the AMI meeting corpus show that CH-DOA can improve the
segmentation while being robust in the case of deactivated microphones.
comment: Interspeech 2023, international Speech Communication Association
(ISCA), Aug 2023, Dublin, Ireland
☆ Transfer Learning from Pre-trained Language Models Improves End-to-End Speech Summarization
Kohei Matsuura, Takanori Ashihara, Takafumi Moriya, Tomohiro Tanaka, Takatomo Kano, Atsunori Ogawa, Marc Delcroix
End-to-end speech summarization (E2E SSum) directly summarizes input speech
into easy-to-read short sentences with a single model. This approach is
promising because it, in contrast to the conventional cascade approach, can
utilize full acoustical information and mitigate to the propagation of
transcription errors. However, due to the high cost of collecting
speech-summary pairs, an E2E SSum model tends to suffer from training data
scarcity and output unnatural sentences. To overcome this drawback, we propose
for the first time to integrate a pre-trained language model (LM), which is
highly capable of generating natural sentences, into the E2E SSum decoder via
transfer learning. In addition, to reduce the gap between the independently
pre-trained encoder and decoder, we also propose to transfer the baseline E2E
SSum encoder instead of the commonly used automatic speech recognition encoder.
Experimental results show that the proposed model outperforms baseline and data
augmented models.
comment: Accepted by Interspeech 2023
☆ Effective Neural Topic Modeling with Embedding Clustering Regularization ICML 2023
Topic models have been prevalent for decades with various applications.
However, existing topic models commonly suffer from the notorious topic
collapsing: discovered topics semantically collapse towards each other, leading
to highly repetitive topics, insufficient topic discovery, and damaged model
interpretability. In this paper, we propose a new neural topic model, Embedding
Clustering Regularization Topic Model (ECRTM). Besides the existing
reconstruction error, we propose a novel Embedding Clustering Regularization
(ECR), which forces each topic embedding to be the center of a separately
aggregated word embedding cluster in the semantic space. This enables each
produced topic to contain distinct word semantics, which alleviates topic
collapsing. Regularized by ECR, our ECRTM generates diverse and coherent topics
together with high-quality topic distributions of documents. Extensive
experiments on benchmark datasets demonstrate that ECRTM effectively addresses
the topic collapsing issue and consistently surpasses state-of-the-art
baselines in terms of topic quality, topic distributions of documents, and
downstream classification tasks.
comment: Accepted to ICML 2023 conference
☆ Leveraging Knowledge Graph Embeddings to Enhance Contextual Representations for Relation Extraction
Relation extraction task is a crucial and challenging aspect of Natural
Language Processing. Several methods have surfaced as of late, exhibiting
notable performance in addressing the task; however, most of these approaches
rely on vast amounts of data from large-scale knowledge graphs or language
models pretrained on voluminous corpora. In this paper, we hone in on the
effective utilization of solely the knowledge supplied by a corpus to create a
high-performing model. Our objective is to showcase that by leveraging the
hierarchical structure and relational distribution of entities within a corpus
without introducing external knowledge, a relation extraction model can achieve
significantly enhanced performance. We therefore proposed a relation extraction
approach based on the incorporation of pretrained knowledge graph embeddings at
the corpus scale into the sentence-level contextual representation. We
conducted a series of experiments which revealed promising and very interesting
results for our proposed approach.The obtained results demonstrated an
outperformance of our method compared to context-based relation extraction
models.
comment: 15 pages, 1 figures, The 17th International Conference on Document
Analysis and Recognition
☆ An ASR-Based Tutor for Learning to Read: How to Optimize Feedback to First Graders SP
The interest in employing automatic speech recognition (ASR) in applications
for reading practice has been growing in recent years. In a previous study, we
presented an ASR-based Dutch reading tutor application that was developed to
provide instantaneous feedback to first-graders learning to read. We saw that
ASR has potential at this stage of the reading process, as the results
suggested that pupils made progress in reading accuracy and fluency by using
the software. In the current study, we used children's speech from an existing
corpus (JASMIN) to develop two new ASR systems, and compared the results to
those of the previous study. We analyze correct/incorrect classification of the
ASR systems using human transcripts at word level, by means of evaluation
measures such as Cohen's Kappa, Matthews Correlation Coefficient (MCC),
precision, recall and F-measures. We observe improvements for the newly
developed ASR systems regarding the agreement with human-based judgment and
correct rejection (CR). The accuracy of the ASR systems varies for different
reading tasks and word types. Our results suggest that, in the current
configuration, it is difficult to classify isolated words. We discuss these
results, possible ways to improve our systems and avenues for future research.
comment: Published (double-blind peer-reviewed) on SPECOM 2021
☆ A New Dataset and Empirical Study for Sentence Simplification in Chinese ACL2023
Sentence Simplification is a valuable technique that can benefit language
learners and children a lot. However, current research focuses more on English
sentence simplification. The development of Chinese sentence simplification is
relatively slow due to the lack of data. To alleviate this limitation, this
paper introduces CSS, a new dataset for assessing sentence simplification in
Chinese. We collect manual simplifications from human annotators and perform
data analysis to show the difference between English and Chinese sentence
simplifications. Furthermore, we test several unsupervised and zero/few-shot
learning methods on CSS and analyze the automatic evaluation and human
evaluation results. In the end, we explore whether Large Language Models can
serve as high-quality Chinese sentence simplification systems by evaluating
them on CSS.
comment: Accepted by ACL2023 main conference
☆ Knowing-how & Knowing-that: A New Task for Machine Reading Comprehension of User Manuals
The machine reading comprehension (MRC) of user manuals has huge potential in
customer service. However,current methods have trouble answering complex
questions. Therefore, we introduce the Knowing-how & Knowing-that task that
requires the model to answer factoid-style, procedure-style, and inconsistent
questions about user manuals. We resolve this task by jointly representing the
steps and facts in a graph (TARA), which supports a unified inference of
various questions. Towards a systematical benchmarking study, we design a
heuristic method to automatically parse user manuals into TARAs and build an
annotated dataset to test the model's ability in answering real-world
questions. Empirical results demonstrate that representing user manuals as
TARAs is a desired solution for the MRC of user manuals. An in-depth
investigation of TARA further sheds light on the issues and broader impacts of
future representations of user manuals. We hope our work can move the MRC of
user manuals to a more complex and realistic stage.
☆ Benchmarking Foundation Models with Language-Model-as-an-Examiner
Yushi Bai, Jiahao Ying, Yixin Cao, Xin Lv, Yuze He, Xiaozhi Wang, Jifan Yu, Kaisheng Zeng, Yijia Xiao, Haozhe Lyu, Jiayin Zhang, Juanzi Li, Lei Hou
Numerous benchmarks have been established to assess the performance of
foundation models on open-ended question answering, which serves as a
comprehensive test of a model's ability to understand and generate language in
a manner similar to humans. Most of these works focus on proposing new
datasets, however, we see two main issues within previous benchmarking
pipelines, namely testing leakage and evaluation automation. In this paper, we
propose a novel benchmarking framework, Language-Model-as-an-Examiner, where
the LM serves as a knowledgeable examiner that formulates questions based on
its knowledge and evaluates responses in a reference-free manner. Our framework
allows for effortless extensibility as various LMs can be adopted as the
examiner, and the questions can be constantly updated given more diverse
trigger topics. For a more comprehensive and equitable evaluation, we devise
three strategies: (1) We instruct the LM examiner to generate questions across
a multitude of domains to probe for a broad acquisition, and raise follow-up
questions to engage in a more in-depth assessment. (2) Upon evaluation, the
examiner combines both scoring and ranking measurements, providing a reliable
result as it aligns closely with human annotations. (3) We additionally propose
a decentralized Peer-examination method to address the biases in a single
examiner. Our data and benchmarking results are available at:
https://lmexam.com.
comment: 23 pages, 8 figures
☆ When to Read Documents or QA History: On Unified and Selective Open-domain QA ACL 2023
This paper studies the problem of open-domain question answering, with the
aim of answering a diverse range of questions leveraging knowledge resources.
Two types of sources, QA-pair and document corpora, have been actively
leveraged with the following complementary strength. The former is highly
precise when the paraphrase of given question $q$ was seen and answered during
training, often posed as a retrieval problem, while the latter generalizes
better for unseen questions. A natural follow-up is thus leveraging both
models, while a naive pipelining or integration approaches have failed to bring
additional gains over either model alone. Our distinction is interpreting the
problem as calibration, which estimates the confidence of predicted answers as
an indicator to decide when to use a document or QA-pair corpus. The
effectiveness of our method was validated on widely adopted benchmarks such as
Natural Questions and TriviaQA.
comment: Findings of ACL 2023 camera ready
☆ From the One, Judge of the Whole: Typed Entailment Graph Construction with Predicate Generation ACL 2023
Entailment Graphs (EGs) have been constructed based on extracted corpora as a
strong and explainable form to indicate context-independent entailment
relations in natural languages. However, EGs built by previous methods often
suffer from the severe sparsity issues, due to limited corpora available and
the long-tail phenomenon of predicate distributions. In this paper, we propose
a multi-stage method, Typed Predicate-Entailment Graph Generator (TP-EGG), to
tackle this problem. Given several seed predicates, TP-EGG builds the graphs by
generating new predicates and detecting entailment relations among them. The
generative nature of TP-EGG helps us leverage the recent advances from large
pretrained language models (PLMs), while avoiding the reliance on carefully
prepared corpora. Experiments on benchmark datasets show that TP-EGG can
generate high-quality and scale-controllable entailment graphs, achieving
significant in-domain improvement over state-of-the-art EGs and boosting the
performance of down-stream inference tasks.
comment: 9 pages, 3 figures, accepted to ACL 2023
☆ Increasing Diversity While Maintaining Accuracy: Text Data Generation with Large Language Models and Human Interventions ACL 2023
Large language models (LLMs) can be used to generate text data for training
and evaluating other models. However, creating high-quality datasets with LLMs
can be challenging. In this work, we explore human-AI partnerships to
facilitate high diversity and accuracy in LLM-based text data generation. We
first examine two approaches to diversify text generation: 1) logit
suppression, which minimizes the generation of languages that have already been
frequently generated, and 2) temperature sampling, which flattens the token
sampling probability. We found that diversification approaches can increase
data diversity but often at the cost of data accuracy (i.e., text and labels
being appropriate for the target domain). To address this issue, we examined
two human interventions, 1) label replacement (LR), correcting misaligned
labels, and 2) out-of-scope filtering (OOSF), removing instances that are out
of the user's domain of interest or to which no considered label applies. With
oracle studies, we found that LR increases the absolute accuracy of models
trained with diversified datasets by 14.4%. Moreover, we found that some models
trained with data generated with LR interventions outperformed LLM-based
few-shot classification. In contrast, OOSF was not effective in increasing
model accuracy, implying the need for future work in human-in-the-loop text
data generation.
comment: Accepted as a long paper at ACL 2023
☆ Knowledge-Augmented Language Model Prompting for Zero-Shot Knowledge Graph Question Answering
Large Language Models (LLMs) are capable of performing zero-shot closed-book
question answering tasks, based on their internal knowledge stored in
parameters during pre-training. However, such internalized knowledge might be
insufficient and incorrect, which could lead LLMs to generate factually wrong
answers. Furthermore, fine-tuning LLMs to update their knowledge is expensive.
To this end, we propose to augment the knowledge directly in the input of LLMs.
Specifically, we first retrieve the relevant facts to the input question from
the knowledge graph based on semantic similarities between the question and its
associated facts. After that, we prepend the retrieved facts to the input
question in the form of the prompt, which is then forwarded to LLMs to generate
the answer. Our framework, Knowledge-Augmented language model PromptING
(KAPING), requires no model training, thus completely zero-shot. We validate
the performance of our KAPING framework on the knowledge graph question
answering task, that aims to answer the user's question based on facts over a
knowledge graph, on which ours outperforms relevant zero-shot baselines by up
to 48% in average, across multiple LLMs of various sizes.
☆ Multimodal Fusion Interactions: A Study of Human and Automatic Quantification
Multimodal fusion of multiple heterogeneous and interconnected signals is a
fundamental challenge in almost all multimodal problems and applications. In
order to perform multimodal fusion, we need to understand the types of
interactions that modalities can exhibit: how each modality individually
provides information useful for a task and how this information changes in the
presence of other modalities. In this paper, we perform a comparative study of
how human annotators can be leveraged to annotate two categorizations of
multimodal interactions: (1) partial labels, where different randomly assigned
annotators annotate the label given the first, second, and both modalities, and
(2) counterfactual labels, where the same annotator is tasked to annotate the
label given the first modality before giving them the second modality and
asking them to explicitly reason about how their answer changes, before
proposing an alternative taxonomy based on (3) information decomposition, where
annotators annotate the degrees of redundancy: the extent to which modalities
individually and together give the same predictions on the task, uniqueness:
the extent to which one modality enables a task prediction that the other does
not, and synergy: the extent to which only both modalities enable one to make a
prediction about the task that one would not otherwise make using either
modality individually. Through extensive experiments and annotations, we
highlight several opportunities and limitations of each approach and propose a
method to automatically convert annotations of partial and counterfactual
labels to information decomposition, yielding an accurate and efficient method
for quantifying interactions in multimodal datasets.
☆ Unbalanced Optimal Transport for Unbalanced Word Alignment ACL 2023
Monolingual word alignment is crucial to model semantic interactions between
sentences. In particular, null alignment, a phenomenon in which words have no
corresponding counterparts, is pervasive and critical in handling semantically
divergent sentences. Identification of null alignment is useful on its own to
reason about the semantic similarity of sentences by indicating there exists
information inequality. To achieve unbalanced word alignment that values both
alignment and null alignment, this study shows that the family of optimal
transport (OT), i.e., balanced, partial, and unbalanced OT, are natural and
powerful approaches even without tailor-made techniques. Our extensive
experiments covering unsupervised and supervised settings indicate that our
generic OT-based alignment methods are competitive against the
state-of-the-arts specially designed for word alignment, remarkably on
challenging datasets with high null alignment frequencies.
comment: Accepted for the Annual Meeting of the Association for Computational
Linguistics (ACL 2023)
☆ Gotta: Generative Few-shot Question Answering by Prompt-based Cloze Data Augmentation
Few-shot question answering (QA) aims at precisely discovering answers to a
set of questions from context passages while only a few training samples are
available. Although existing studies have made some progress and can usually
achieve proper results, they suffer from understanding deep semantics for
reasoning out the questions. In this paper, we develop Gotta, a Generative
prOmpT-based daTa Augmentation framework to mitigate the challenge above.
Inspired by the human reasoning process, we propose to integrate the cloze task
to enhance few-shot QA learning. Following the recent success of prompt-tuning,
we present the cloze task in the same format as the main QA task, allowing the
model to learn both tasks seamlessly together to fully take advantage of the
power of prompt-tuning. Extensive experiments on widely used benchmarks
demonstrate that Gotta consistently outperforms competitive baselines,
validating the effectiveness of our proposed prompt-tuning-based cloze task,
which not only fine-tunes language models but also learns to guide reasoning in
QA tasks. Further analysis shows that the prompt-based loss incorporates the
auxiliary task better than the multi-task loss, highlighting the strength of
prompt-tuning on the few-shot QA task.
☆ XSemPLR: Cross-Lingual Semantic Parsing in Multiple Natural Languages and Meaning Representations ACL 2023
Cross-Lingual Semantic Parsing (CLSP) aims to translate queries in multiple
natural languages (NLs) into meaning representations (MRs) such as SQL, lambda
calculus, and logic forms. However, existing CLSP models are separately
proposed and evaluated on datasets of limited tasks and applications, impeding
a comprehensive and unified evaluation of CLSP on a diverse range of NLs and
MRs. To this end, we present XSemPLR, a unified benchmark for cross-lingual
semantic parsing featured with 22 natural languages and 8 meaning
representations by examining and selecting 9 existing datasets to cover 5 tasks
and 164 domains. We use XSemPLR to conduct a comprehensive benchmark study on a
wide range of multilingual language models including encoder-based models
(mBERT, XLM-R), encoder-decoder models (mBART, mT5), and decoder-based models
(Codex, BLOOM). We design 6 experiment settings covering various lingual
combinations (monolingual, multilingual, cross-lingual) and numbers of learning
samples (full dataset, few-shot, and zero-shot). Our experiments show that
encoder-decoder models (mT5) achieve the highest performance compared with
other popular models, and multilingual training can further improve the average
performance. Notably, multilingual large language models (e.g., BLOOM) are
still inadequate to perform CLSP tasks. We also find that the performance gap
between monolingual training and cross-lingual transfer learning is still
significant for multilingual models, though it can be mitigated by
cross-lingual few-shot training. Our dataset and code are available at
https://github.com/psunlpgroup/XSemPLR.
comment: ACL 2023
☆ Text-only Domain Adaptation using Unified Speech-Text Representation in Transducer
Domain adaptation using text-only corpus is challenging in end-to-end(E2E)
speech recognition. Adaptation by synthesizing audio from text through TTS is
resource-consuming. We present a method to learn Unified Speech-Text
Representation in Conformer Transducer(USTR-CT) to enable fast domain
adaptation using the text-only corpus. Different from the previous textogram
method, an extra text encoder is introduced in our work to learn text
representation and is removed during inference, so there is no modification for
online deployment. To improve the efficiency of adaptation, single-step and
multi-step adaptations are also explored. The experiments on adapting
LibriSpeech to SPGISpeech show the proposed method reduces the word error
rate(WER) by relatively 44% on the target domain, which is better than those of
TTS method and textogram method. Also, it is shown the proposed method can be
combined with internal language model estimation(ILME) to further improve the
performance.
comment: Submitted to Interspeech 2023
♻ ☆ PALR: Personalization Aware LLMs for Recommendation
Large language models (LLMs) have recently received significant attention for
their exceptional capabilities. Despite extensive efforts in developing
general-purpose LLMs that can be utilized in various natural language
processing (NLP) tasks, there has been less research exploring their potential
in recommender systems. In this paper, we propose a novel framework, named
PALR, which aiming to combine user history behaviors (such as clicks,
purchases, ratings, etc.) with LLMs to generate user preferred items.
Specifically, we first use user/item interactions as guidance for candidate
retrieval. Then we adopt a LLM-based ranking model to generate recommended
items. Unlike existing approaches that typically adopt general-purpose LLMs for
zero/few-shot recommendation testing or training on small-sized language models
(with less than 1 billion parameters), which cannot fully elicit LLMs'
reasoning abilities and leverage rich item side parametric knowledge, we
fine-tune a 7 billion parameters LLM for the ranking purpose. This model takes
retrieval candidates in natural language format as input, with instruction
which explicitly asking to select results from input candidates during
inference. Our experimental results demonstrate that our solution outperforms
state-of-the-art models on various sequential recommendation tasks.
♻ ☆ Language Models can Solve Computer Tasks
Agents capable of carrying out general tasks on a computer can improve
efficiency and productivity by automating repetitive tasks and assisting in
complex problem-solving. Ideally, such agents should be able to solve new
computer tasks presented to them through natural language commands. However,
previous approaches to this problem require large amounts of expert
demonstrations and task-specific reward functions, both of which are
impractical for new tasks. In this work, we show that a pre-trained large
language model (LLM) agent can execute computer tasks guided by natural
language using a simple prompting scheme where the agent Recursively Criticizes
and Improves its output (RCI). The RCI approach significantly outperforms
existing LLM methods for automating computer tasks and surpasses supervised
learning (SL) and reinforcement learning (RL) approaches on the MiniWoB++
benchmark. We compare multiple LLMs and find that RCI with the
InstructGPT-3+RLHF LLM is state-of-the-art on MiniWoB++, using only a handful
of demonstrations per task rather than tens of thousands, and without a
task-specific reward function. Furthermore, we demonstrate RCI prompting's
effectiveness in enhancing LLMs' reasoning abilities on a suite of natural
language reasoning tasks, outperforming chain of thought (CoT) prompting. We
find that RCI combined with CoT performs better than either separately. Our
code can be found here: https://github.com/posgnu/rci-agent.
♻ ☆ ChatDB: Augmenting LLMs with Databases as Their Symbolic Memory
Large language models (LLMs) with memory are computationally universal.
However, mainstream LLMs are not taking full advantage of memory, and the
designs are heavily influenced by biological brains. Due to their approximate
nature and proneness to the accumulation of errors, conventional neural memory
mechanisms cannot support LLMs to simulate complex reasoning. In this paper, we
seek inspiration from modern computer architectures to augment LLMs with
symbolic memory for complex multi-hop reasoning. Such a symbolic memory
framework is instantiated as an LLM and a set of SQL databases, where the LLM
generates SQL instructions to manipulate the SQL databases. We validate the
effectiveness of the proposed memory framework on a synthetic dataset requiring
complex reasoning. The project website is available at
https://chatdatabase.github.io/ .
♻ ☆ Z-Code++: A Pre-trained Language Model Optimized for Abstractive Summarization ACL 2023
Pengcheng He, Baolin Peng, Liyang Lu, Song Wang, Jie Mei, Yang Liu, Ruochen Xu, Hany Hassan Awadalla, Yu Shi, Chenguang Zhu, Wayne Xiong, Michael Zeng, Jianfeng Gao, Xuedong Huang
This paper presents Z-Code++, a new pre-trained language model optimized for
abstractive text summarization. The model extends the state of the art
encoder-decoder model using three techniques. First, we use a two-phase
pre-training process to improve model's performance on low-resource
summarization tasks. The model is first pre-trained using text corpora for
language understanding, and then is continually pre-trained on summarization
corpora for grounded text generation. Second, we replace self-attention layers
in the encoder with disentangled attention layers, where each word is
represented using two vectors that encode its content and position,
respectively. Third, we use fusion-in-encoder, a simple yet effective method of
encoding long sequences in a hierarchical manner. Z-Code++ creates new state of
the art on 9 out of 13 text summarization tasks across 5 languages. Our model
is parameter-efficient in that it outperforms the 600x larger PaLM-540B on
XSum, and the finetuned 200x larger GPT3-175B on SAMSum. In zero-shot and
few-shot settings, our model substantially outperforms the competing models.
comment: 16 pages, 3 figures. Accepted as long paper in main conference of ACL
2023
♻ ☆ Easily Accessible Text-to-Image Generation Amplifies Demographic Stereotypes at Large Scale
Federico Bianchi, Pratyusha Kalluri, Esin Durmus, Faisal Ladhak, Myra Cheng, Debora Nozza, Tatsunori Hashimoto, Dan Jurafsky, James Zou, Aylin Caliskan
Machine learning models that convert user-written text descriptions into
images are now widely available online and used by millions of users to
generate millions of images a day. We investigate the potential for these
models to amplify dangerous and complex stereotypes. We find a broad range of
ordinary prompts produce stereotypes, including prompts simply mentioning
traits, descriptors, occupations, or objects. For example, we find cases of
prompting for basic traits or social roles resulting in images reinforcing
whiteness as ideal, prompting for occupations resulting in amplification of
racial and gender disparities, and prompting for objects resulting in
reification of American norms. Stereotypes are present regardless of whether
prompts explicitly mention identity and demographic language or avoid such
language. Moreover, stereotypes persist despite mitigation strategies; neither
user attempts to counter stereotypes by requesting images with specific
counter-stereotypes nor institutional attempts to add system ``guardrails''
have prevented the perpetuation of stereotypes. Our analysis justifies concerns
regarding the impacts of today's models, presenting striking exemplars, and
connecting these findings with deep insights into harms drawn from social
scientific and humanist disciplines. This work contributes to the effort to
shed light on the uniquely complex biases in language-vision models and
demonstrates the ways that the mass deployment of text-to-image generation
models results in mass dissemination of stereotypes and resulting harms.
comment: FAccT 2023 paper. The published version is available at
10.1145/3593013.3594095
♻ ☆ SpokenWOZ: A Large-Scale Speech-Text Dataset for Spoken Task-Oriented Dialogue in Multiple Domains
Shuzheng Si, Wentao Ma, Haoyu Gao, Yuchuan Wu, Ting-En Lin, Yinpei Dai, Hangyu Li, Rui Yan, Fei Huang, Yongbin Li
Task-oriented dialogue (TOD) models have made significant progress in recent
years. However, previous studies primarily focus on datasets written by
annotators, which has resulted in a gap between academic research and
real-world spoken conversation scenarios. While several small-scale spoken TOD
datasets are proposed to address robustness issues such as ASR errors, they
ignore the unique challenges in spoken conversation. To tackle the limitations,
we introduce SpokenWOZ, a large-scale speech-text dataset for spoken TOD,
containing 8 domains, 203k turns, 5.7k dialogues and 249 hours of audios from
human-to-human spoken conversations. SpokenWOZ further incorporates common
spoken characteristics such as word-by-word processing and reasoning in spoken
language. Based on these characteristics, we present cross-turn slot and
reasoning slot detection as new challenges. We conduct experiments on various
baselines, including text-modal models, newly proposed dual-modal models, and
LLMs, e.g., ChatGPT. The results show that the current models still have
substantial room for improvement in spoken conversation, where the most
advanced dialogue state tracker only achieves 25.65% in joint goal accuracy and
the SOTA end-to-end model only correctly completes the user request in 52.1% of
dialogues. The dataset, code, and leaderboard are available:
https://spokenwoz.github.io/SpokenWOZ-github.io/.
♻ ☆ Extrapolative Controlled Sequence Generation via Iterative Refinement ICML 2023
We study the problem of extrapolative controlled generation, i.e., generating
sequences with attribute values beyond the range seen in training. This task is
of significant importance in automated design, especially drug discovery, where
the goal is to design novel proteins that are \textit{better} (e.g., more
stable) than existing sequences. Thus, by definition, the target sequences and
their attribute values are out of the training distribution, posing challenges
to existing methods that aim to directly generate the target sequence. Instead,
in this work, we propose Iterative Controlled Extrapolation (ICE) which
iteratively makes local edits to a sequence to enable extrapolation. We train
the model on synthetically generated sequence pairs that demonstrate small
improvement in the attribute value. Results on one natural language task
(sentiment analysis) and two protein engineering tasks (ACE2 stability and AAV
fitness) show that ICE considerably outperforms state-of-the-art approaches
despite its simplicity. Our code and models are available at:
https://github.com/vishakhpk/iter-extrapolation.
comment: ICML 2023 - Camera Ready Version
♻ ☆ Handling the Alignment for Wake Word Detection: A Comparison Between Alignment-Based, Alignment-Free and Hybrid Approaches
Wake word detection exists in most intelligent homes and portable devices. It
offers these devices the ability to "wake up" when summoned at a low cost of
power and computing. This paper focuses on understanding alignment's role in
developing a wake-word system that answers a generic phrase. We discuss three
approaches. The first is alignment-based, where the model is trained with
frame-wise cross-entropy. The second is alignment-free, where the model is
trained with CTC. The third, proposed by us, is a hybrid solution in which the
model is trained with a small set of aligned data and then tuned with a
sizeable unaligned dataset. We compare the three approaches and evaluate the
impact of the different aligned-to-unaligned ratios for hybrid training. Our
results show that the alignment-free system performs better than the
alignment-based for the target operating point, and with a small fraction of
the data (20%), we can train a model that complies with our initial
constraints.
comment: Accepted to Interspeech 2023
♻ ☆ Multi-Party Chat: Conversational Agents in Group Settings with Humans and Models
Current dialogue research primarily studies pairwise (two-party)
conversations, and does not address the everyday setting where more than two
speakers converse together. In this work, we both collect and evaluate
multi-party conversations to study this more general case. We use the LIGHT
environment to construct grounded conversations, where each participant has an
assigned character to role-play. We thus evaluate the ability of language
models to act as one or more characters in such conversations. Models require
two skills that pairwise-trained models appear to lack: (1) being able to
decide when to talk; (2) producing coherent utterances grounded on multiple
characters. We compare models trained on our new dataset to existing
pairwise-trained dialogue models, as well as large language models with
few-shot prompting. We find that our new dataset, MultiLIGHT, which we will
publicly release, can help bring significant improvements in the group setting.
♻ ☆ ChatGPT an ENFJ, Bard an ISTJ: Empirical Study on Personalities of Large Language Models
Large Language Models (LLMs) have made remarkable advancements in the field
of artificial intelligence, significantly reshaping the human-computer
interaction. We not only focus on the performance of LLMs, but also explore
their features from a psychological perspective, acknowledging the importance
of understanding their behavioral characteristics. Our study examines the
behavioral patterns displayed by LLMs by employing trait theory, a
psychological framework. We first focus on evaluating the consistency of
personality types exhibited by ChatGPT. Furthermore, experiments include
cross-lingual effects on seven additional languages, and the investigation of
six other LLMs. Moreover, the study investigates whether ChatGPT can exhibit
personality changes in response to instructions or contextual cues. The
findings show that ChatGPT consistently maintains its ENFJ personality
regardless of instructions or contexts. By shedding light on the
personalization of LLMs, we anticipate that our study will serve as a catalyst
for further research in this field.
comment: Added robustness analysis against fine-tuning (results of
text-davinci-003); Added results of ChatGLM; Added limitations
♻ ☆ Uncovering and Categorizing Social Biases in Text-to-SQL
Content Warning: This work contains examples that potentially implicate
stereotypes, associations, and other harms that could be offensive to
individuals in certain social groups.} Large pre-trained language models are
acknowledged to carry social biases towards different demographics, which can
further amplify existing stereotypes in our society and cause even more harm.
Text-to-SQL is an important task, models of which are mainly adopted by
administrative industries, where unfair decisions may lead to catastrophic
consequences. However, existing Text-to-SQL models are trained on clean,
neutral datasets, such as Spider and WikiSQL. This, to some extent, cover up
social bias in models under ideal conditions, which nevertheless may emerge in
real application scenarios. In this work, we aim to uncover and categorize
social biases in Text-to-SQL models. We summarize the categories of social
biases that may occur in structured data for Text-to-SQL models. We build test
benchmarks and reveal that models with similar task accuracy can contain social
biases at very different rates. We show how to take advantage of our
methodology to uncover and assess social biases in the downstream Text-to-SQL
task. We will release our code and data.
♻ ☆ Causal interventions expose implicit situation models for commonsense language understanding ACL
Accounts of human language processing have long appealed to implicit
``situation models'' that enrich comprehension with relevant but unstated world
knowledge. Here, we apply causal intervention techniques to recent transformer
models to analyze performance on the Winograd Schema Challenge (WSC), where a
single context cue shifts interpretation of an ambiguous pronoun. We identify a
relatively small circuit of attention heads that are responsible for
propagating information from the context word that guides which of the
candidate noun phrases the pronoun ultimately attends to. We then compare how
this circuit behaves in a closely matched ``syntactic'' control where the
situation model is not strictly necessary. These analyses suggest distinct
pathways through which implicit situation models are constructed to guide
pronoun resolution.
comment: Findings of ACL
♻ ☆ Exploring Anisotropy and Outliers in Multilingual Language Models for Cross-Lingual Semantic Sentence Similarity ACL
Previous work has shown that the representations output by contextual
language models are more anisotropic than static type embeddings, and typically
display outlier dimensions. This seems to be true for both monolingual and
multilingual models, although much less work has been done on the multilingual
context. Why these outliers occur and how they affect the representations is
still an active area of research. We investigate outlier dimensions and their
relationship to anisotropy in multiple pre-trained multilingual language
models. We focus on cross-lingual semantic similarity tasks, as these are
natural tasks for evaluating multilingual representations. Specifically, we
examine sentence representations. Sentence transformers which are fine-tuned on
parallel resources (that are not always available) perform better on this task,
and we show that their representations are more isotropic. However, we aim to
improve multilingual representations in general. We investigate how much of the
performance difference can be made up by only transforming the embedding space
without fine-tuning, and visualise the resulting spaces. We test different
operations: Removing individual outlier dimensions, cluster-based isotropy
enhancement, and ZCA whitening. We publish our code for reproducibility.
comment: To appear in ACL Findings 2023. Fixed a citation in this version
♻ ☆ A Context-Sensitive Word Embedding Approach for The Detection of Troll Tweets
In this study, we aimed to address the growing concern of trolling behavior
on social media by developing and evaluating a set of model architectures for
the automatic detection of troll tweets. Utilizing deep learning techniques and
pre-trained word embedding methods such as BERT, ELMo, and GloVe, we evaluated
the performance of each architecture using metrics such as classification
accuracy, F1 score, AUC, and precision. Our results indicate that BERT and ELMo
embedding methods performed better than the GloVe method, likely due to their
ability to provide contextualized word embeddings that better capture the
nuances and subtleties of language use in online social media. Additionally, we
found that CNN and GRU encoders performed similarly in terms of F1 score and
AUC, suggesting their effectiveness in extracting relevant information from
input text. The best-performing method was found to be an ELMo-based
architecture that employed a GRU classifier, with an AUC score of 0.929. This
research highlights the importance of utilizing contextualized word embeddings
and appropriate encoder methods in the task of troll tweet detection, which can
assist social-based systems in improving their performance in identifying and
addressing trolling behavior on their platforms.
♻ ☆ Improving Cancer Hallmark Classification with BERT-based Deep Learning Approach
This paper presents a novel approach to accurately classify the hallmarks of
cancer, which is a crucial task in cancer research. Our proposed method
utilizes the Bidirectional Encoder Representations from Transformers (BERT)
architecture, which has shown exceptional performance in various downstream
applications. By applying transfer learning, we fine-tuned the pre-trained BERT
model on a small corpus of biomedical text documents related to cancer. The
outcomes of our experimental investigations demonstrate that our approach
attains a noteworthy accuracy of 94.45%, surpassing almost all prior findings
with a substantial increase of at least 8.04% as reported in the literature.
These findings highlight the effectiveness of our proposed model in accurately
classifying and comprehending text documents for cancer research, thus
contributing significantly to the field. As cancer remains one of the top ten
leading causes of death globally, our approach holds great promise in advancing
cancer research and improving patient outcomes.
♻ ☆ HowkGPT: Investigating the Detection of ChatGPT-generated University Student Homework through Context-Aware Perplexity Analysis
As the use of Large Language Models (LLMs) in text generation tasks
proliferates, concerns arise over their potential to compromise academic
integrity. The education sector currently tussles with distinguishing
student-authored homework assignments from AI-generated ones. This paper
addresses the challenge by introducing HowkGPT, designed to identify homework
assignments generated by AI. HowkGPT is built upon a dataset of academic
assignments and accompanying metadata [17] and employs a pretrained LLM to
compute perplexity scores for student-authored and ChatGPT-generated responses.
These scores then assist in establishing a threshold for discerning the origin
of a submitted assignment. Given the specificity and contextual nature of
academic work, HowkGPT further refines its analysis by defining
category-specific thresholds derived from the metadata, enhancing the precision
of the detection. This study emphasizes the critical need for effective
strategies to uphold academic integrity amidst the growing influence of LLMs
and provides an approach to ensuring fair and accurate grading in educational
institutions.
♻ ☆ Using Bottleneck Adapters to Identify Cancer in Clinical Notes under Low-Resource Constraints
Omid Rohanian, Hannah Jauncey, Mohammadmahdi Nouriborji, Vinod Kumar Chauhan, Bronner P. Gonçalves, Christiana Kartsonaki, ISARIC Clinical Characterisation Group, Laura Merson, David Clifton
Processing information locked within clinical health records is a challenging
task that remains an active area of research in biomedical NLP. In this work,
we evaluate a broad set of machine learning techniques ranging from simple RNNs
to specialised transformers such as BioBERT on a dataset containing clinical
notes along with a set of annotations indicating whether a sample is
cancer-related or not.
Furthermore, we specifically employ efficient fine-tuning methods from NLP,
namely, bottleneck adapters and prompt tuning, to adapt the models to our
specialised task. Our evaluations suggest that fine-tuning a frozen BERT model
pre-trained on natural language and with bottleneck adapters outperforms all
other strategies, including full fine-tuning of the specialised BioBERT model.
Based on our findings, we suggest that using bottleneck adapters in
low-resource situations with limited access to labelled data or processing
capacity could be a viable strategy in biomedical text mining. The code used in
the experiments are going to be made available at
https://github.com/omidrohanian/bottleneck-adapters.
♻ ☆ Saliency Map Verbalization: Comparing Feature Importance Representations from Model-free and Instruction-based Methods ACL 2023
Nils Feldhus, Leonhard Hennig, Maximilian Dustin Nasert, Christopher Ebert, Robert Schwarzenberg, Sebastian Möller
Saliency maps can explain a neural model's predictions by identifying
important input features. They are difficult to interpret for laypeople,
especially for instances with many features. In order to make them more
accessible, we formalize the underexplored task of translating saliency maps
into natural language and compare methods that address two key challenges of
this approach -- what and how to verbalize. In both automatic and human
evaluation setups, using token-level attributions from text classification
tasks, we compare two novel methods (search-based and instruction-based
verbalizations) against conventional feature importance representations
(heatmap visualizations and extractive rationales), measuring simulatability,
faithfulness, helpfulness and ease of understanding. Instructing GPT-3.5 to
generate saliency map verbalizations yields plausible explanations which
include associations, abstractive summarization and commonsense reasoning,
achieving by far the highest human ratings, but they are not faithfully
capturing numeric information and are inconsistent in their interpretation of
the task. In comparison, our search-based, model-free verbalization approach
efficiently completes templated verbalizations, is faithful by design, but
falls short in helpfulness and simulatability. Our results suggest that
saliency map verbalization makes feature attribution explanations more
comprehensible and less cognitively challenging to humans than conventional
representations.
comment: ACL 2023 Workshop on Natural Language Reasoning and Structured
Explanations (NLRSE)
♻ ☆ Assessing Linguistic Generalisation in Language Models: A Dataset for Brazilian Portuguese
Much recent effort has been devoted to creating large-scale language models.
Nowadays, the most prominent approaches are based on deep neural networks, such
as BERT. However, they lack transparency and interpretability, and are often
seen as black boxes. This affects not only their applicability in downstream
tasks but also the comparability of different architectures or even of the same
model trained using different corpora or hyperparameters. In this paper, we
propose a set of intrinsic evaluation tasks that inspect the linguistic
information encoded in models developed for Brazilian Portuguese. These tasks
are designed to evaluate how different language models generalise information
related to grammatical structures and multiword expressions (MWEs), thus
allowing for an assessment of whether the model has learned different
linguistic phenomena. The dataset that was developed for these tasks is
composed of a series of sentences with a single masked word and a cue phrase
that helps in narrowing down the context. This dataset is divided into MWEs and
grammatical structures, and the latter is subdivided into 6 tasks: impersonal
verbs, subject agreement, verb agreement, nominal agreement, passive and
connectors. The subset for MWEs was used to test BERTimbau Large, BERTimbau
Base and mBERT. For the grammatical structures, we used only BERTimbau Large,
because it yielded the best results in the MWE task.
comment: This is the original manuscript that was submitted to LREV. The final
version was published recently and can be found at: https://rdcu.be/ddEa6.
Language Resources and Evaluation, https://doi.org/10.1007/s10579-023-09664-1
♻ ☆ Tractable Control for Autoregressive Language Generation
Despite the success of autoregressive large language models in text
generation, it remains a major challenge to generate text that satisfies
complex constraints: sampling from the conditional distribution
${\Pr}(\text{text} | \alpha)$ is intractable for even the simplest lexical
constraints $\alpha$. To overcome this challenge, we propose to use tractable
probabilistic models (TPMs) to impose lexical constraints in autoregressive
text generation models, which we refer to as GeLaTo (Generating Language with
Tractable Constraints). To demonstrate the effectiveness of this framework, we
use distilled hidden Markov models, where we can efficiently compute
${\Pr}(\text{text} | \alpha)$, to guide autoregressive generation from GPT2.
GeLaTo achieves state-of-the-art performance on challenging benchmarks for
constrained text generation (e.g., CommonGen), beating various strong baselines
by a large margin. Our work not only opens up new avenues for controlling large
language models but also motivates the development of more expressive TPMs.
♻ ☆ SQL-PaLM: Improved Large Language Model Adaptation for Text-to-SQL
Ruoxi Sun, Sercan O. Arik, Hootan Nakhost, Hanjun Dai, Rajarishi Sinha, Pengcheng Yin, Tomas Pfister
One impressive emergent capability of large language models (LLMs) is
generation of code, including Structured Query Language (SQL) for databases.
For the task of converting natural language text to SQL queries, Text-to-SQL,
adaptation of LLMs is of paramount importance, both in in-context learning and
fine-tuning settings, depending on the amount of adaptation data used. In this
paper, we propose an LLM-based Text-to-SQL model SQL-PaLM, leveraging on
PaLM-2, that pushes the state-of-the-art in both settings. Few-shot SQL-PaLM is
based on an execution-based self-consistency prompting approach designed for
Text-to-SQL, and achieves 77.3% in test-suite accuracy on Spider, which to our
best knowledge is the first to outperform previous state-of-the-art with
fine-tuning by a significant margin, 4%. Furthermore, we demonstrate that the
fine-tuned SQL-PALM outperforms it further by another 1%. Towards applying
SQL-PaLM to real-world scenarios we further evaluate its robustness on other
challenging variants of Spider and demonstrate the superior generalization
capability of SQL-PaLM. In addition, via extensive case studies, we demonstrate
the impressive intelligent capabilities and various success enablers of
LLM-based Text-to-SQL.
comment: 16 pages
♻ ☆ Test-Time Training on Nearest Neighbors for Large Language Models
Many recent efforts aim to augment language models with relevant information
retrieved from a database at test time. We avoid the need for prompt
engineering by directly fine-tuning the model on data retrieved at test time
using its standard training setup. For this purpose, we build a large-scale
distributed nearest neighbor index based on text embeddings of the Pile
dataset. Given a query to a language model, our system retrieves the neighbors
of the query and fine-tunes the model on the text data corresponding to those
neighbors. Surprisingly, retrieving and training on as few as 20 neighbors,
each for only one gradient iteration, drastically improves performance across
more than twenty language modeling tasks in the Pile benchmark. For example,
test-time training significantly narrows the performance gap between a small
GPT2 model and a GPTNeo model, more than ten times larger, that was
specifically trained to convergence on the Pile. Sufficient index quality and
size, however, are important. Our work establishes a valuable first baseline
for implementing test-time training in the context of large language models,
opening the door to numerous promising research avenues.
comment: Corrected Figure 8. Code repository here:
https://github.com/socialfoundations/tttlm
♻ ☆ Neural Natural Language Processing for Long Texts: A Survey of the State-of-the-Art
The adoption of Deep Neural Networks (DNNs) has greatly benefited Natural
Language Processing (NLP) during the past decade. However, the demands of long
document analysis are quite different from those of shorter texts, while the
ever increasing size of documents uploaded on-line renders automated
understanding of long texts a critical area of research. This article has two
goals: a) it overviews the relevant neural building blocks, thus serving as a
short tutorial, and b) it surveys the state-of-the-art in long document NLP,
mainly focusing on two central tasks: document classification and document
summarization. Sentiment analysis for long texts is also covered, since it is
typically treated as a particular case of document classification. Thus, this
article concerns document-level analysis. It discusses the main challenges and
issues of long document NLP, along with the current solutions. Finally, the
relevant, publicly available, annotated datasets are presented, in order to
facilitate further research.
comment: 51 pages, 2 figures, 168 citations
♻ ☆ CADGE: Context-Aware Dialogue Generation Enhanced with Graph-Structured Knowledge Aggregation
Commonsense knowledge is crucial to many natural language processing tasks.
Existing works usually incorporate graph knowledge with conventional graph
neural networks (GNNs), leading to the text and graph knowledge encoding
processes being separated in a serial pipeline. We argue that these separate
representation learning stages may be suboptimal for neural networks to learn
the overall context contained in both types of input knowledge. In this paper,
we propose a novel context-aware graph-attention model (Context-aware GAT),
which can effectively incorporate global features of relevant knowledge graphs
based on a context-enhanced knowledge aggregation process. Specifically, our
framework leverages a novel representation learning approach to process
heterogeneous features - combining flattened graph knowledge with text. To the
best of our knowledge, this is the first attempt at hierarchically applying
graph knowledge aggregation on a connected subgraph in addition to contextual
information to support commonsense dialogue generation. This framework shows
superior performance compared to conventional GNN-based language frameworks.
Both automatic and human evaluation demonstrates that our proposed model has
significant performance uplifts over state-of-the-art baselines.
comment: Submitted to KBS
♻ ☆ TwistList: Resources and Baselines for Tongue Twister Generation
Previous work in phonetically-grounded language generation has mainly focused
on domains such as lyrics and poetry. In this paper, we present work on the
generation of tongue twisters - a form of language that is required to be
phonetically conditioned to maximise sound overlap, whilst maintaining semantic
consistency with an input topic, and still being grammatically correct. We
present \textbf{TwistList}, a large annotated dataset of tongue twisters,
consisting of 2.1K+ human-authored examples. We additionally present several
benchmark systems (referred to as TwisterMisters) for the proposed task of
tongue twister generation, including models that both do and do not require
training on in-domain data. We present the results of automatic and human
evaluation to demonstrate the performance of existing mainstream pre-trained
models in this task with limited (or no) task specific training and data, and
no explicit phonetic knowledge. We find that the task of tongue twister
generation is challenging for models under these conditions, yet some models
are still capable of generating acceptable examples of this language type.
♻ ☆ Early Discovery of Emerging Entities in Persian Twitter with Semantic Similarity
Discovering emerging entities (EEs) is the problem of finding entities before
their establishment. These entities can be critical for individuals, companies,
and governments. Many of these entities can be discovered on social media
platforms, e.g. Twitter. These identities have been the spot of research in
academia and industry in recent years. Similar to any machine learning problem,
data availability is one of the major challenges in this problem. This paper
proposes EEPT. That is an online clustering method able to discover EEs without
any need for training on a dataset. Additionally, due to the lack of a proper
evaluation metric, this paper uses a new metric to evaluate the results. The
results show that EEPT is promising and finds significant entities before their
establishment.
♻ ☆ Global Contrastive Batch Sampling via Optimization on Sample Permutations ICML 2023
Contrastive Learning has recently achieved state-of-the-art performance in a
wide range of tasks. Many contrastive learning approaches use mined hard
negatives to make batches more informative during training but these approaches
are inefficient as they increase epoch length proportional to the number of
mined negatives and require frequent updates of nearest neighbor indices or
mining from recent batches. In this work, we provide an alternative to hard
negative mining, Global Contrastive Batch Sampling (GCBS), an efficient
approximation to the batch assignment problem that upper bounds the gap between
the global and training losses, $\mathcal{L}^{Global} - \mathcal{L}^{Train}$,
in contrastive learning settings. Through experimentation we find GCBS improves
state-of-the-art performance in sentence embedding and code-search tasks.
Additionally, GCBS is easy to implement as it requires only a few additional
lines of code, does not maintain external data structures such as nearest
neighbor indices, is more computationally efficient than the most minimal hard
negative mining approaches, and makes no changes to the model being trained.
comment: ICML 2023; 21 pages, 7 figures
♻ ☆ bgGLUE: A Bulgarian General Language Understanding Evaluation Benchmark ACL 2023
Momchil Hardalov, Pepa Atanasova, Todor Mihaylov, Galia Angelova, Kiril Simov, Petya Osenova, Ves Stoyanov, Ivan Koychev, Preslav Nakov, Dragomir Radev
We present bgGLUE(Bulgarian General Language Understanding Evaluation), a
benchmark for evaluating language models on Natural Language Understanding
(NLU) tasks in Bulgarian. Our benchmark includes NLU tasks targeting a variety
of NLP problems (e.g., natural language inference, fact-checking, named entity
recognition, sentiment analysis, question answering, etc.) and machine learning
tasks (sequence labeling, document-level classification, and regression). We
run the first systematic evaluation of pre-trained language models for
Bulgarian, comparing and contrasting results across the nine tasks in the
benchmark. The evaluation results show strong performance on sequence labeling
tasks, but there is a lot of room for improvement for tasks that require more
complex reasoning. We make bgGLUE publicly available together with the
fine-tuning and the evaluation code, as well as a public leaderboard at
https://bgglue.github.io/, and we hope that it will enable further advancements
in developing NLU models for Bulgarian.
comment: Accepted to ACL 2023 (Main Conference)
♻ ☆ Learning Multi-Step Reasoning by Solving Arithmetic Tasks ACL 2023
Mathematical reasoning is regarded as a necessary ability for Language Models
(LMs). Recent works demonstrate large LMs' impressive performance in solving
math problems. The success is attributed to their Chain-of-Thought (CoT)
reasoning abilities, i.e., the ability to decompose complex questions into
step-by-step reasoning chains, but such ability seems only to emerge from
models with abundant parameters. This work investigates how to incorporate
relatively small LMs with the capabilities of multi-step reasoning. We propose
to inject such abilities by continually pre-training LMs on a synthetic dataset
MsAT which is composed of Multi-step Arithmetic Tasks. Our experiments on four
math word problem datasets show the effectiveness of the proposed method in
enhancing LMs' math reasoning abilities.
comment: ACL 2023. Code and data are available at
https://github.com/TianduoWang/MsAT
♻ ☆ DataFinder: Scientific Dataset Recommendation from Natural Language Descriptions ACL 2023
Modern machine learning relies on datasets to develop and validate research
ideas. Given the growth of publicly available data, finding the right dataset
to use is increasingly difficult. Any research question imposes explicit and
implicit constraints on how well a given dataset will enable researchers to
answer this question, such as dataset size, modality, and domain. We
operationalize the task of recommending datasets given a short natural language
description of a research idea, to help people find relevant datasets for their
needs. Dataset recommendation poses unique challenges as an information
retrieval problem; datasets are hard to directly index for search and there are
no corpora readily available for this task. To facilitate this task, we build
the DataFinder Dataset which consists of a larger automatically-constructed
training set (17.5K queries) and a smaller expert-annotated evaluation set (392
queries). Using this data, we compare various information retrieval algorithms
on our test set and present a superior bi-encoder retriever for text-based
dataset recommendation. This system, trained on the DataFinder Dataset, finds
more relevant search results than existing third-party dataset search engines.
To encourage progress on dataset recommendation, we release our dataset and
models to the public.
comment: To appear at ACL 2023. Code published at
https://github.com/viswavi/datafinder
♻ ☆ MACSum: Controllable Summarization with Mixed Attributes ACL 2023
Yusen Zhang, Yang Liu, Ziyi Yang, Yuwei Fang, Yulong Chen, Dragomir Radev, Chenguang Zhu, Michael Zeng, Rui Zhang
Controllable summarization allows users to generate customized summaries with
specified attributes. However, due to the lack of designated annotations of
controlled summaries, existing works have to craft pseudo datasets by adapting
generic summarization benchmarks. Furthermore, most research focuses on
controlling single attributes individually (e.g., a short summary or a highly
abstractive summary) rather than controlling a mix of attributes together
(e.g., a short and highly abstractive summary). In this paper, we propose
MACSum, the first human-annotated summarization dataset for controlling mixed
attributes. It contains source texts from two domains, news articles and
dialogues, with human-annotated summaries controlled by five designed
attributes (Length, Extractiveness, Specificity, Topic, and Speaker). We
propose two simple and effective parameter-efficient approaches for the new
task of mixed controllable summarization based on hard prompt tuning and soft
prefix tuning. Results and analysis demonstrate that hard prompt models yield
the best performance on all metrics and human evaluations. However,
mixed-attribute control is still challenging for summarization tasks. Our
dataset and code are available at https://github.com/psunlpgroup/MACSum.
comment: TACL 2023
♻ ☆ ChatGPT Informed Graph Neural Network for Stock Movement Prediction
ChatGPT has demonstrated remarkable capabilities across various natural
language processing (NLP) tasks. However, its potential for inferring dynamic
network structures from temporal textual data, specifically financial news,
remains an unexplored frontier. In this research, we introduce a novel
framework that leverages ChatGPT's graph inference capabilities to enhance
Graph Neural Networks (GNN). Our framework adeptly extracts evolving network
structures from textual data, and incorporates these networks into graph neural
networks for subsequent predictive tasks. The experimental results from stock
movement forecasting indicate our model has consistently outperformed the
state-of-the-art Deep Learning-based benchmarks. Furthermore, the portfolios
constructed based on our model's outputs demonstrate higher annualized
cumulative returns, alongside reduced volatility and maximum drawdown. This
superior performance highlights the potential of ChatGPT for text-based network
inferences and underscores its promising implications for the financial sector.
comment: Under Review. 10 pages, 2 figures
♻ ☆ Deductive Verification of Chain-of-Thought Reasoning
Large Language Models (LLMs) significantly benefit from Chain-of-Thought
(CoT) prompting in performing various reasoning tasks. While CoT allows models
to produce more comprehensive reasoning processes, its emphasis on intermediate
reasoning steps can inadvertently introduce hallucinations and accumulated
errors, thereby limiting models' ability to solve complex reasoning tasks.
Inspired by how humans engage in careful and meticulous deductive logical
reasoning processes to solve tasks, we seek to enable language models to
perform explicit and rigorous deductive reasoning, and also ensure the
trustworthiness of their reasoning process through self-verification. However,
directly verifying the validity of an entire deductive reasoning process is
challenging, even with advanced models like ChatGPT. In light of this, we
propose to decompose a reasoning verification process into a series of
step-by-step subprocesses, each only receiving their necessary context and
premises. To facilitate this procedure, we propose Natural Program, a natural
language-based deductive reasoning format. Our approach enables models to
generate precise reasoning steps where subsequent steps are more rigorously
grounded on prior steps. It also empowers language models to carry out
reasoning self-verification in a step-by-step manner. By integrating this
verification process into each deductive reasoning stage, we significantly
enhance the rigor and trustfulness of generated reasoning steps. Along this
process, we also improve the answer correctness on complex reasoning tasks.
Code will be released at https://github.com/lz1oceani/verify_cot.
♻ ☆ Inference-Time Intervention: Eliciting Truthful Answers from a Language Model
We introduce Inference-Time Intervention (ITI), a technique designed to
enhance the truthfulness of large language models (LLMs). ITI operates by
shifting model activations during inference, following a set of directions
across a limited number of attention heads. This intervention significantly
improves the performance of LLaMA models on the TruthfulQA benchmark. On an
instruction-finetuned LLaMA called Alpaca, ITI improves its truthfulness from
32.5% to 65.1%. We identify a tradeoff between truthfulness and helpfulness and
demonstrate how to balance it by tuning the intervention strength. ITI is
minimally invasive and computationally inexpensive. Moreover, the technique is
data efficient: while approaches like RLHF require extensive annotations, ITI
locates truthful directions using only few hundred examples. Our findings
suggest that LLMs may have an internal representation of the likelihood of
something being true, even as they produce falsehoods on the surface.
comment: code: https://github.com/likenneth/honest_llama